1 Overview

This session will give hands-on feel on conducting a bioinformatics workshop.

In the next 2 hours, we will cover the following secions ;

  1. Introduction to Binformatics
  2. Introduction to R
  3. RNA-sequencing analysis

2 The question that we will answer

Cancer is a disease in which some of the body’s cells grow uncontrollably and spread to other parts of the body. Cancer is caused by certain changes to genes.

What are the differentially expressed genes between normal cells and cancer cells?


3 Introduction to Cancer Bioinformatics

FIGURE 1 - Overarching concept of cancer bioinformatics. Figuring out the difference between two systems.
FIGURE 1 - Overarching concept of cancer bioinformatics. Figuring out the difference between two systems.
FIGURE 2 - Cancer is a genetic disease
FIGURE 2 - Cancer is a genetic disease
  • Cancer is caused by changes to genes that control the way our cell function.
  • There are about 20,000 genes in the human genome.
  • Image taken from National Cancer Institute

4 Introduction to R and RStudio

What is R

  • R is an open-source language and environment for statistical computing and graphics, widely used by scientists.
  • R is both a computational language and environment for statistical computing, data visualization, data science and machine learning
  • RStudio is an integrated development environment for R and Python
  • Rstudio provides a graphic user interface for working with R
  • In this session, we will showcase an cloud based RStudio Server -
  • User can install R and Rstudio locally on their device

Introduction to RStudio interface

  • Panel towards the top left is the scrip
  • Basic math function

Addition

3 + 3
## [1] 6

Multiplication

3 * 3
## [1] 9

Storing variables in R

num1 <- 5
num2 = 10

num1 + num2
## [1] 15

A more practical example. Lets create a vector storing multiple values

#create vectors to hold plant heights from each sample
group1 <- c(8, 8, 9, 9, 9, 11, 12, 13, 13, 14)
group2 <- c(22, 23, 24, 24, 25, 26, 27, 20, 26, 28)

Lets get the sum

8 + 8 + 9 + 9 + 9 + 11 + 12 + 13 + 13 + 14
## [1] 106

Now calculate the mean

(8 + 8 + 9 + 9 + 9 + 11 + 12 + 13 + 13 + 14) / 10
## [1] 10.6

Use built in function in R

Get the sum

sum(group1)
## [1] 106

Get the mean

mean(group1)
## [1] 10.6

4.1 Visuzalize data

# Calculate means
means <- c(mean(group1), mean(group2))

# Make bar plot
barplot(means, names.arg = c("Group 1", "Group 2"),
        col = c("skyblue", "salmon"),
        main = "Average Plant Height",
        ylab = "Mean Height")

4.2 Statistical test

A t-test is a statistical test that is used to compare the means of two groups. It is often used in hypothesis testing to determine whether two groups are different from one another.

\[ t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \]

where:
- \(\bar{x}_1, \bar{x}_2\) are the sample means
- \(s_1^2, s_2^2\) are the sample variances
- \(n_1, n_2\) are the sample sizes

Variance is simply the spread of the data

Click to show/hide code to generate the plot
library(ggplot2)
df <- data.frame(
  value = c(group1, group2),
  group = c(rep("group1", length(group1)), rep("group2", length(group2)))
)

# Calculate means and variances
means <- tapply(df$value, df$group, mean)
vars <- tapply(df$value, df$group, var)

# Data for variance boxes
rects <- data.frame(
  group = c("group1", "group2"),
  xmin = c(0.7, 1.7),
  xmax = c(1.3, 2.3),
  ymin = means - vars/2,
  ymax = means + vars/2,
  mean = means,
  var = vars
)
ggplot(df, aes(x = group, y = value, color = group)) +
  geom_jitter(width = 0.1, size = 3, alpha = 0.7) +
  # Variance box
  geom_rect(data = rects, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), 
            fill = NA, color = "red", linetype = "dashed", inherit.aes = FALSE) +
  # Variance annotation
  geom_text(data = rects, aes(x = group, y = ymax + 10, 
                              label = paste0("Variance = ", round(var, 2))), 
            color = "red", size = 4, inherit.aes = FALSE) +
  # Mean point
  geom_point(data = rects, aes(x = group, y = mean), color = "black", size = 4, inherit.aes = FALSE) +
  labs(title = "Variance Visualized as a Box",
       y = "Value") +
  theme_bw()


The T-test is performed using the t.test() function, which essentially tests for the difference in means of a variable between two groups.

t.test(group1, group2)
## 
##  Welch Two Sample t-test
## 
## data:  group1 and group2
## t = -13.26, df = 17.932, p-value = 1.048e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -16.10295 -11.69705
## sample estimates:
## mean of x mean of y 
##      10.6      24.5

t.test saves a lot of information: the difference in means estimate, confidence interval for the difference conf.int, the p-value p.value, etc.


Without using the build in function, it will look something like this ;

# Data
group1 <- c(8, 8, 9, 9, 9, 11, 12, 13, 13, 14)
group2 <- c(22, 23, 24, 24, 25, 26, 27, 20, 26, 28)

# Calculate means
mean1 <- mean(group1)
mean2 <- mean(group2)

# Calculate sample variances
var1 <- var(group1)  # s1^2
var2 <- var(group2)  # s2^2

# Sample sizes
n1 <- length(group1)
n2 <- length(group2)

# Standard error using variances
se <- sqrt((var1 / n1) + (var2 / n2))

# t-statistic
t_stat <- (mean1 - mean2) / se

# Degrees of freedom (Welch's)
df <- ( (var1/n1 + var2/n2)^2 ) / 
      ( ((var1/n1)^2)/(n1-1) + ((var2/n2)^2)/(n2-1) )

# p-value
p_value <- 2 * pt(-abs(t_stat), df)

# Print results
cat("p-value:", p_value, "\n")
## p-value: 1.048483e-10

5 RNA-sequencing analysis

FIGURE 3 - Example of gene expression change
FIGURE 3 - Example of gene expression change
  • The example shows the expression level of one gene.
  • How can we repeat the analysis for 20,000 genes?
  • Where does the gene expression data comes from? Sequencing

5.1 Example data

  • 20 patients
    • 10 normal
    • 10 cancer
  • About 20,000 genes

5.2 Load files in R

Lets load in the actual data from an experiment comparing a tumor and normal RNA-seq experiment.

First we read the in the sample data and look at the files.

df.meta <- read.csv("../output/sample_data.csv")
df.meta
##          Run  group
## 1  SRR975580 normal
## 2  SRR975577 normal
## 3  SRR975575 normal
## 4  SRR975584 normal
## 5  SRR975570 normal
## 6  SRR975579 normal
## 7  SRR975585 normal
## 8  SRR975583 normal
## 9  SRR975586 normal
## 10 SRR975573 normal
## 11 SRR975581 normal
## 12 SRR975574 normal
## 13 SRR975582 normal
## 14 SRR975571 normal
## 15 SRR975576 normal
## 16 SRR975560  tumor
## 17 SRR975557  tumor
## 18 SRR975554  tumor
## 19 SRR975563  tumor
## 20 SRR975559  tumor
## 21 SRR975555  tumor
## 22 SRR975552  tumor
## 23 SRR975567  tumor
## 24 SRR975564  tumor
## 25 SRR975551  tumor
## 26 SRR975566  tumor
## 27 SRR975556  tumor
## 28 SRR975558  tumor
## 29 SRR975568  tumor
## 30 SRR975565  tumor

Now we read into R the RNA-seq data. Lets look at the data first 3 rows and 5 columns. Each row is a gene and each column is a sample.

raw.counts <- read.csv("../output/raw_counts.csv", row.names = 1)
raw.counts[1:3, 1:5]
##        SRR975580 SRR975577 SRR975575 SRR975584 SRR975570
## TSPAN6      2704      1309       906       981       691
## TNMD          21        19         5        10         4
## DPM1         783       389       406       419       278

What are these numbers? - They are the number of reads that mapped to each gene.

FIGURE 4 - Example of distibution of reads across genes.
FIGURE 4 - Example of distibution of reads across genes.

5.3

Lets now perfrom

6 References